-
Notifications
You must be signed in to change notification settings - Fork 395
T7101: Add hardware watchdog support via systemd #4843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: current
Are you sure you want to change the base?
Conversation
|
👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds hardware watchdog support to VyOS, allowing users to configure systemd's runtime watchdog functionality to monitor system health and trigger automatic reboots on hang/freeze conditions.
Key changes:
- New Python configuration module for hardware watchdog with configurable timeouts
- XML interface definition for CLI commands to enable/disable watchdog and set parameters
- Jinja2 template for systemd watchdog configuration
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/conf_mode/system_watchdog.py | Core configuration logic for watchdog including validation, module loading, and systemd configuration |
| interface-definitions/system_watchdog.xml.in | CLI interface definition with timeout parameters and module specification |
| data/templates/system/watchdog.conf.j2 | Systemd configuration template for watchdog runtime settings |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b066fe4 to
8796ec3
Compare
|
Added a basic smoketest for the watchdog: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Binary cannot build |
4dfe354 to
abf012c
Compare
|
The failed smoktest looks like a false positive: |
|
I've made all the requested changes. Let me know if anything more is required. A minor addition if desired might be to allow specify which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| modules_load_directory.mkdir(exist_ok=True) | ||
| modules_load_file.write_text(f"{module}\n") | ||
| except Exception as e: | ||
| print(f"Warning: Failed writing modules-load configuration: {e}") |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using bare Exception in except clauses is too broad and could mask unexpected errors. Consider catching more specific exceptions (e.g., OSError, IOError) or letting unexpected exceptions propagate to be caught by the airbag error handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave this to VyOS team to decide.
| try: | ||
| call(f'modprobe {module}') | ||
| except Exception as e: | ||
| print(f"Warning: Could not load watchdog module '{module}': {e}") |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using bare Exception in except clauses is too broad and could mask unexpected errors. Consider catching more specific exceptions related to process execution (e.g., OSError) or letting unexpected exceptions propagate to be caught by the airbag error handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave this to VyOS team to decide.
|
CI integration ❌ failed! Details
|
dmbaturin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea and the CLI seems fine to me. I left a couple of questions/suggestions that I'd like to address before we merge it, in addition to concerns people already pointed out.
| <help>Kernel module to load for watchdog device (optional)</help> | ||
| <valueHelp> | ||
| <format>txt</format> | ||
| <description>Module name (e.g. 'softdog', 'iTCO_wdt', 'sp5100_tco')</description> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally prefer set-time validation over commit-time validation whenever possible, since it provides a faster feedback and the user doesn't have to commit to discover that some options are wrong. Two questions from this perspective:
- How likely is it that the module for the hardware watchdog actually present in the system will not be loaded? Can we fully automate that and either load the module for the watchdog that the system has or fail the commit if there's no module matching the watchdog?
- If we can't fully automate it, can we add an explicit list of available watchdog module for set-time validation?
| </leafNode> | ||
| <leafNode name="timeout"> | ||
| <properties> | ||
| <help>Watchdog timeout for runtime in seconds (1-86400)</help> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 86400 the upper limit for those timeouts in the kernel or systemd? If it's not, I believe we shouldn't introduce an artificial limit — one person's constant is another person's variable, as Alan Perlis quipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max value of watchdog timeout varies per watchdog. Eg: bcm2835_wdt has a max timeout of 16 seconds. The iTCO_wdt module on v1 hardware is around 30 seconds and as high as 10mins on V2 hardware.
I will investigate further, it is likely that the max value can be read from some modules once it has been loaded.
sarthurdev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for quickly addressing the requested changes.
I think we just need to figure out how to handle the module validation, XML-based is always the preferred method. But we also should validate the specified module is a watchdog driver.
| watchdog_config_dir = Path('/run/systemd/system.conf.d') | ||
| watchdog_config_file = Path(watchdog_config_dir / 'watchdog.conf') | ||
| modules_load_directory = Path('/run/modules-load.d') | ||
| modules_load_file = Path(modules_load_directory / 'watchdog.conf') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| watchdog_config_dir = Path('/run/systemd/system.conf.d') | |
| watchdog_config_file = Path(watchdog_config_dir / 'watchdog.conf') | |
| modules_load_directory = Path('/run/modules-load.d') | |
| modules_load_file = Path(modules_load_directory / 'watchdog.conf') | |
| watchdog_config_dir = Path('/run/systemd/system.conf.d') | |
| watchdog_config_file = watchdog_config_dir / 'watchdog.conf' | |
| modules_load_directory = Path('/run/modules-load.d') | |
| modules_load_file = modules_load_directory / 'watchdog.conf' |
Don't need to encapsulate the file paths in another Path()
| # If a module is provided, verify it resolves via modprobe (dry-run) | ||
| if module: | ||
| # Dry-run modprobe (-n) in quiet mode (-q) verifies availability without loading | ||
| rc = run(f'modprobe -n -q {module}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only issue here is that it doesn't verify the specified module is a valid watchdog driver.
As per @dmbaturin's comment, probably best to build a validator or list of watchdog modules enabled in the VyOS kernel.
Change summary
Add vyos support for having the hardware watchdog managed by systemd. Have systemd also use the hardware watchdog to control shutdown and reboot timeouts.
Types of changes
Related Task(s)
https://vyos.dev/T7101
Related PR(s)
vyos/vyos-documentation#1703 (feature documentation)
How to test / Smoketest result
Checklist: